Jamshidpey and Sankoff BMC Bioinformatics 2013, 14(Suppl 1 5):S7 
http://www.biomedcentral.eom/1 471 -2 1 05/1 4/S1 5/S7 



^BMC 



Bioinformatics 



PROCEEDINGS 



Open Access 



Phase change for the accuracy of the median 
value in estimating divergence time 

Arash Jamshidpey*, David Sankoff 

From Eleventh Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Com- 
parative Genomics 
Lyon, France. 17-19 October 2013 



Abstract 

We prove that for general models of random gene-order evolution of k > 3 genomes, as the number of genes n 
goes to oo the median value approximates k times the divergence time if the number of rearrangements is less 
than cn/4 for any c <1. For some c* > 1, if the number of rearrangements is greater than c*n/4, this approximation 
does not hold. 



Introduction 

The iterative improvement of approximate solutions to 
the Steiner tree problem by optimizing one internal ver- 
tex at a time has a substantial history in the "small phy- 
logeny" problem for parsimony-based phylogenetics, 
both at the sequence level [1] and the gene order level 
[2]. It has been generalized to iterative local subtree 
optimization methods such as "tree-window-hill" [3] and 
"disc covering" [4,5]. Here we focus on the "median pro- 
blem" for gene order where we estimate the location of 
a single point (the median) in a metric space given the 
location of the three or more points connected to the 
median by an edge of the tree. Given k > 3 signed gene 
orders Gi, Gk on a single chromosome or several 
chromosomes, and a metric d such as breakpoints [6], 
inversions [7], inversions and translocations [8], or dou- 
ble-cut-and-join [9], find the gene order M such that 



d(Gi,M) is minimized. 



Although it plays a central role in gene order phylo- 
geny, the median suffers from several liabilities. One is 
that it is hard to calculate in most metric spaces. Not 
only is it NP-hard [10], but exhaustive methods are 
costly for most instances, namely unless Gi . . . , Gfc are 
all relatively similar to each other, which we will refer to 
generically as the similar genomes condition. Another 
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problem is that heuristics tend to produce inaccurate 
results unless a suitable similar genomes condition 
holds [11]. Still another, is the tendency in some metric 
spaces to degenerate solutions [12] unless the same con- 
ditions prevails. 

In this paper we add to this litany of difficulties by 
showing that as k genomes evolve over time, as modeled 
by any one of several biologically-motivated random 
walks, there is a phase change after «/4 steps, where n 
is the number of genes. With w < «/4 steps, the sum of 
the normalized distances from each of the gen- 

omes to the starting point - the ancestor - converges to 
ku/n in probability, and this is the median value. When 
u > c*«/4 steps, for a constant c* > 1, the sum of the 
normalized distances to the median converges in prob- 
ability to a value less than ku/n, and that the ancestor is 
no longer the median. 

Our proof is inspired by a result of Berestycki and 
Durrett [13] in showing that the reversal distance 
between two signed permutations converges in probabil- 
ity to the actual number of steps, after rescaling, if and 
only if u < nil. The technique is to construct a graph 
with genes as vertices and edges added between vertices 
according to how they are affected by transpositions. 
Properties of the number of components of random 
Erdos-Renyi graphs can then be invoked to prove the 
result. 
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Definitions 

We represent a unichromosomal genome by a signed 
permutation, where the sign indicates whether the gene 
is "read" from left to right (tail-to-head) or from right to 
left (head to tail) on the chromosome. Let be the 
signed symmetric group of order n, i.e. the space of all 
signed permutations of length n. A reversal operation 
applied to a signed permutation reverses the order, and 
changes the signs, of one or more adjacent terms in the 
permutation. A DCJ operation, which can apply not 
only to signed permutations but to more general gen- 
omes containing linear and circular chromosomes, cuts 
the genome in two places and rejoins pairs of the four 
"loose ends" in one of two possible new ways (one of 
which may be equivalent to a reversal). We define the 
reversal and DCJ distances, d r and dej, to be the mini- 
mum number of reversal and DCJ operations, respec- 
tively, needed to transform one genome to another. 

The breakpoint graph BP(H, IT) of two genomes 
represented by n and IT contains vertices for the head 
and tail of each gene, black edges edges defined by the 
adjoining heads or tails of two adjacent genes in the 
genome II and grey edges defined by two adjacent genes 
in the genome IT'. Let id = I, the identity permutation, 
and BP(U) = BP(n, id). It is well-known that 

dcj(n) = n+ 1 - \cBP{U)\. (1) 

We need to define an orientation for grey (and black) 
edges of BP{Il). We traverse a cycle c e cBP{Tl) in a 
counter-clockwise manner if we start at the left-most ver- 
tex of BP(Yl) (in the usual representation), travel along its 
unique adjacent black edge and end at the same vertex 
through its unique adjacent grey edge. Then we say a 
black edge in c is positively oriented if we move along it 
from left to right in a counter-clockwise traversal. Other- 
wise we say it is negatively oriented. Similarly, for the 
grey edge (i t , [i + l)h) we say it is positively oriented if 
during a counter-clockwise traversal we move along it 
from i t to (i + 1)/,. Otherwise it is negatively oriented. We 
define the orientation function £ on the edges of BP(U) 
to be: 

, , _ J +1 if e is positively oriented . . 

} — 1 if e is negatively oriented. 

We say the black (grey) edges e, e are parallel, denoted 
by e 1 1 e if c(e) = f(e'). Otherwise we say they are crossing. 
This is just a reformulation of Hannenhalli and Pevzner's 
original concept of oriented cycles. An oriented cycle in 
this definition is a cycle including at least one positively 
and one negatively oriented black edge. The mechanism 
by which a reversal affects a genome can easily be seen 
using the BP graph. Let p be a reversal acting on two 
black edges e, e in BP{T\). If they are in two different 



cycles we have a merger of the two to construct a new 
cycle. But if e, e are in a same cycle, that cycle either splits, 
if e jf e, or does not split if e 1 1 e . 

Limit Behavior of the Median Value 

Suppose d" be a metric on the space of all signed per- 
mutations length n. For a set A of these permutations, 
define 

gl'":S±^N 0 =NU{0}, (3) 

yeA 

Then let 

m in {A) := min{gf n {x) :xeS±|. (5) 

m d ' n {A) is called the median value of A under the 
metric d". A signed permutation which makes g^ n mini- 
mum is called a median solution of A. Denote by d r and 
dej the reversal and DCJ distances on S*. 

Let X 0 = id, the identity permutation, and let X" be a 
stochastic process on S*, where at random Poisson 
times t K , with rate 1, we choose two elements of X^, 
namely i, j and let p{i, j) operate on X£ , that is 

X n zt =Xlop{i,j), (6) 

where p(i, j) is the reversal acting on i and We call X" 
a reversal random walk (r.w.) on. S„. Suppose 
X\' n , . . . , x\' n be k independent reversal r.w. all starting 
at the identity element, id. Define 

A^:={Xl' n 4'"} ( y ) 

and 

et n :=giyd)-m d »[A t ). (8) 

We investigate the time up to which the median value 
of x]' n , . . . , Xf' n , namely m d '" (A t ), remains a good esti- 
mator for the total divergence time, kt, as well as to the 
total distance of points in A t to id, namely g^ n {id). To 
answer this question we use the fact that the speed of 
escape of the r.w. up to some particular time, is the same 
from any point of the space and is close to 1, the maxi- 
mum value. Berestycki and Durrett studied speed of 
transposition and reversal random walks with the related 
edit distances while in the latter they used "approximate 
reversal distance" instead of reversal itself, ignoring the 
effect of hurdles and fortresses. This turns out to be the 
same as DCJ distance on single chromosomes. We have 

dr[7t, I) = n + 1 - c(tt) + h{7i) +/(tt) (9) 



Jamshidpey and Sankoff BMC Bioinformatics 2013, 14(Suppl 1 5):S7 
http://www.biomedcentral.eom/1 471 -2 1 05/1 4/S1 5/S7 



Page 3 of 6 



while 



dq{n, I) = n + 1 — c{tt), 



(10) 



where h{n) and /(tt) are the number of hurdles and 
fortresses, respectively. 

Although Berestycki and Durrett only proved their 
theorem for the random transposition r.w. on S n , they 
suggested that same method should carry over to rever- 
sal r.w. The following proposition is proved in [13] for 
approximate reversal distance (i.e., DCJ distance). 

In this result and in the ensuing discussion a n is an 
arbitrary sequence such that a n — > °° as n — > 0. When it 
is unambiguous we drop n from A;"' and X". 

Propostition 1 [Berestycki-Durrett] Let c be fixed and 
let Xt be a reversal r.w. on starting at id. Then 



(11) 



(12) 



dq{id, X cn/2 ) = (1 -f{c))n + w{n), 
where 

k=l 

and —> 0 in probability. 

Remark 1 The function 1 - f is linear for c <l,f(c) = 
1 - c/2, and sublinear for c >1, 1 -/(c) < c/2 This 
means that for c < 1 



CI I 



dcj(id, X cn/2 ) - — = w{n) 



(13) 



and r.w. travels on an approximate geodesic (or parsi- 
monious path) asymptotically almost surely, f is the func- 
tion counting the number of tree components of an 
Erdos-Renyi random graph with n vertices for which the 
probability of having each edge is ^, denoted by G(c, n). 
See Theorem 12 in [14], Chapter V. 

We extend the above theorem for the bonafide reversal 
distance. To do so we need to estimate the number of hur- 
dles of • Recall that an oriented cycle in a breakpoint 
graph is a cycle including an orientation edge, that is a grey 
edge with two black adjacency edges e, e , where a reversal 
involving e and e splits the cycle [15]. As we discussed this 
is equivalent to saying e Jf e. It is not difficult to show 

Lemma 1 Let C e cBP(n), then C is oriented if and only 
if there exists exactly two equivalence classes of black edges, 
that is there exist at least two black edges with different 
signs. 

Then 

Theorem 1 Let c >0 be fixed and let X t be a reversal 
r.w.starting at id. Define h t := h(X t ) to be the number of 
hurdles in BP {X t ). Then 



«cn/2 

a n ~Jn 



0 in probability. 



(14) 



Proof. Cycles of the BP that have never been involved 
in a fragmentation event must be oriented, since the 
two rejoined black edges resulting from an inversion- 
induced merger of cycles cannot be parallel. 

Therefore we need only to count the number of edges 
that have been involved in a fragmentation event. To do 
so we apply the method of counting cycles in [13], The- 
orem 3. Hurdles occur only in those cycles with length 
more than one that have been involved in a fragmenta- 
tion up to time y. We call such cycles fragmented 
cycles. The number of fragmented cycles with length 
more than ^fn is always less than ^fn. But to count all 
fragmented cycles in ^y with size less than ^/n we need 
to find an upper bound for the rate of a fragmentation 
up to time y. Since a fragmentation occurs when two 
black edges in one cycle are chosen, to fragment a cycle 
in BP, for any chosen black edge e we only can pick 
another black edge e' in the same cycle whose graph dis- 
tance in the breakpoint graph is less than lyfn. (The 
coefficient 2 arises from the fact that the cycles are 
alternating in BP.) 

Thus the rate of fragmentation at an arbitrary time t is 
not more than ~ • - -2= Integrating up to time t, 

this gives us the expected number of fragmented cycles 
at time t is For t = y this expectation is c^Jn. Now, 
dividing by a n ^/n, the result follows from Chebyshev's 
inequality and the fact that hurdles only occurs in frag- 
mented cycles. ■ 

Theorem 2 let c >0 be fixed and let X t be a reversal r.w. 
on S^ starting at id and let A ■= d[ n ^ denote the reversal 
distance on St. Then 



d r (id, X c „i 2 ) = (1 -f{c))n + w'{n) 



(15) 



where f is the same function as in the statement of Pro- 
position 1 and w' («) is a function with 0 in 
probability. 

Proof. Since d r (U) = dcj(U) + h(Tl) +/ (FT) by the pro- 
position we have d r (X cn/2 ) = (1 - f{c))n + w{n) + h cn/2 + 
f(X cn/2 ). But 



u/(n) w{n) + hcn/2 + f {Xcn/2) 



0 



(16) 



a n ^/n a n ^/n 

in probability, by the convergence of and in 

Proposition 1 and Theorem 2 and j (X cn j 2 ) < XM 

Theorem 3 Let x\' n , . . . , x\' n be k independent rever- 
sal r.w in starting at id. Suppose either 

a) d := dej dej distance 
or 

b) ft ■= av 1 ) reversal distance. 

1 c^' n 

Then for c <j we have -^2= — > 0 in probability. 

4 <i„Vn 
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Proof. We prove the theorem only for d r . The proof 
of the DCJ case is similar. For all i, j e {1, k} and for 
a median solution x of a["' 



df\x\\ x'f) < dT\x, x'/ n ) + dr{x, x>;"). 

Therefore, 



(")< 



(17) 



£4 n) (4 n < 4") < E^ n> (*< *h +4 n) (*, 4" )). (18) 

We conclude 

E W< xf ) < (fe- l)m d '"(A; n) ) < (fe-l)s A „(ui).(19) 
Let c < Then by Theorem 2 we have for all /, / i * /' 
df\x l », X ] il)=1cn-w{n) (20) 



and 



d\ n \id, X^) = cn-w{n) 



(21) 



where 



^ a ^ r 0 in probability. Thus 

(2c« - u/(n)) < (fe- l)**"^') < (fe - l)fc(en - w(n)). (22) 

Then 



|m d *"(A™) - M < fe'w(n) 



(23) 



for a constant Also \g A w{id) - kcn\ <kw{n). 
Therefore, there exists a constant k* such that 



\m d - n {Ai: ] )-g%{id)\ <k*w{n). 



(24) 



This implies 



. m d -"{A%>)-g%{id) 



a n ^fn 



a n *Jn 



0 in probability. ( 2 ^) 



This proves the theorem. ■ 

Remark 2 77ze statement of the theorem suggests 
ignoring the error of order o{a n ^fn) for a n — > °°. id 
remains as the median of leaves of k independent sto- 
chastic processes x\' n , . . . , x\' n up to time | asymptoti- 
cally almost surely. 

Theorem 4 Let c < \be fixed. Suppose d is either DCJ 
or reversal distance. Then by the hypothesis of Theorem 3 



kcn-m d ' n {A cn ) 
a n ~Jn 



0 in probability asn oo. (26) 



Proof. This follows directly from the fact that 

ken-g^jid) p 
a ny /n 

in probability. ■ 



(27) 



Now, it is natural to ask whether the statement of 
Theorem 4 also holds for some time after \. In other 
words, is the median value ken a fair estimator for the 
total time of divergence? We conjecture not, that the 
property is lost after time |, but for now can only prove 
a weaker upper bound for this time. 

Theorem 5 Let c > \ be fixed. Suppose d is either DC] 
or reversal distance. Then by the same hypothesis as in 
Theorem 3 



kcn-m d ' n {A cn ) 
n 

where 
a c := fe(l - /(2c)) 



(28) 



(29) 



is strictly positive for c > j 

Remark 3 This theorem shows after time § the error is 
of order n and so the median value is not a good esti- 
mate of k times the divergence time. 

Proof. 

hen - m d '"{A cn ) > ken - g^(ui) = fc(l -/(2c))n + w(n), (30) 

where — > 0 in probability. Dividing by n, the 
result follows. ■ 

In fact, since / (c), c >0 is decreasing and for c <1, 
/(c) = 1 — ~, it is easy to see that in the case k = 3, for c 
>0.75, e*" is of order f}fn for some fif > 0. 

Theorem 6 Let k = 3 and d be either dej or dr. Con- 
sider the same hypothesis in Theorem 3. Assume c* be 
solution of 



/(-) = - 



Then for all c > c* there exists fifsuch that 



(31) 



(32) 



e d £ = o{fi d n)- 
4 

Proof. 



m d '\Acn) < d{X\Z, X 2 cn n ) + d[xW, X 3 c ' n n ). (33) 

4 — — — — 



4 

Computing d{X^ , X^) for i = 2, 3 is the same as 

In 4 4 

d[id, X^n ). This is true since the Cayley graph of 

w.r.t. reversals is symmetric and regular and so 
P{X 0 = id, Xcn = n) = P(X 0 = n, Xcn = id)^ g u t there- 

4 4 

fore by symmetry of the Cayley graph we can just con- 
sider d{id, X^_ ). Hence, 

2 



m din {Acn) < 2(1 -f(c))n + 2w[n). 

4 



(34) 
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Let x >0 be so that 

0< -2(1 -/(*)) + 3(1 -/(£)). 
This means 

g d A n cn [id) > m d ' n {Acn). 



(35) 



(36) 



So it suffices to prove above inequality for x = c >c . 
Since f (x) >0 for all x >0 



l + 2/W-3/(|)>l-3/(|) 



(37) 



in which the right hand side is strictly increasing, 
Therefore for all c > c* 



l + 2/( C )-3/(|)>l-3/(-) = 0. 



(38) 



This proves the statement. ■ 

Now, we would like to measure the volume of that 
part of the space for which median does well, com- 
pared with the whole space. The ratio of the two con- 
verges to 0 as n goes to °°, showing that the median is 
only useful in a highly restricted region of the space.. 
The following theorem is entailed by a theorem in [16]. 
Let c n = c„(Il) be the number of cycles in the BP graph 
of a random n e S^. Let d„ be a distance (metric) on 
St Define 



B a cn = Bf n n := {n e S* d{Tl, id) < cn] 

to be the ball of radius cn in S*. 
Theorem 7 Let 0 < c < 1 be fixed. Then 



(39) 



a)y n = 



Iggj 

\si\ 



0 as n — > oo, 



7i / l^cnl ,, 
^JKn = — — > 0 as n — >• oo. 



(40) 



(41) 



Proof. 

a) 



For all n e , |cBP(n)| > (1 - c)n. 



(42) 



Suppose y„ does not converge to 0. Therefore there 
exists a subsequence {n,}i e N such that Yn t > £ for a con- 
stant e >0. This implies 



£(c nj ) > e(l — c)n;. 
But by Theorem 2.2 in [16], we have 

> 0 as Hv — >■ oo. 



(43) 



(44) 



That is in contradiction with the above inequality 
since 



e(l — c)rii 



e(l - c) > 0. 



(45) 



b) For the second part it suffices to observe that for all 
n e St we have 



d r {n) > dcj{U). 
Therefore 

B d ' (n) C B dci {Y\) 
and the result follows part (a) since 

y' n < Yn — > 0 as n — >■ oo. 



(46) 



(47) 



(48) 



Conclusion 

We have shown that the median value for DCJ and for 
reversal distance for a reversal r.w.has good limiting 
properties if the number of steps remains below cw/4, 
for any c <1, but for some value c > 1, more than this 
number of steps destroys these limiting properties. The 
critical value may indeed be c = 1, but for now we can 
only show that for c > 3 (and c > 2) the median value is 
no longer a good estimator of the distance between the 
id and the current position of the r.w. (and k times the 
divergence time, respectively). 

Note that a simulation strategy to estimate c is not 
available because of the hardness of calculating the med- 
ian. As n increases even to moderate values all exact 
methods require prohibitive computing time. 

These results imply that the steinerization strategy for 
the small phylogeny problem may lead to poor estimates 
of the interior nodes of a phylogeny unless the taxon 
sampling is sufficient to assure that a "similar genomes 
condition" holds for every k-tuple of genomes used in 
the course of of the iterative optimization search. This 
can be monitored prior to each step in the iterative 
optimization of the phylogeny through successive appli- 
cation of the median method. 
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